2015

Heterosis

High-parent heterosis

Low-parent heterosis

Gene-expression heterosis


  • Maize data taken from a Genome Research study (Paschold et al. 2012).
  • 4 genetic lines available:
    • B73 (parent1)
    • Mo17 (parent2)
    • B73xMo17 (hybrid)
    • Mo17xB73 (alternate hybrid, discarded)

The model

Parameterization

  • \(N = 12\) libraries, index \(n\).
  • \(G = 39656\) genes, index \(g\).
  • Log-scale expression levels \(\mu_{g, \ \text{genotype}}\)
  • \(\eta(n, g) = \mu_{g, \ \text{genotype of }n}\)
  • \[ \begin{align*} \mu_{g, \ \text{parent 1}} &= \phi_g + \alpha_g - \delta_g \\ \mu_{g, \ \text{parent 2}} &= \phi_g - \alpha_g + \delta_g \\ \mu_{g, \ \text{hybrid}} \ \ &= \phi_g + \alpha_g + \delta_g \\ \\ \end{align*} \]
  • \[ \begin{align*} \phi_g &= \frac{\mu_{g, \ \text{parent 1}} + \mu_{g, \ \text{parent 2}}}{2} \\ \alpha_g &= \frac{\mu_{g, \ \text{hybrid }} - \mu_{g, \ \text{parent 2}}}{2} \\ \delta_g &= \frac{\mu_{g, \ \text{hybrid }} - \mu_{g, \ \text{parent 1}}}{2} \end{align*} \]

Heterosis defined

\[\begin{align*} \\ &\text{HPH gene $g$ if} \ &&\mu_{g, \ \text{hybrid}} > \max \left ( \mu_{g, \ \text{parent 1}}, \ \mu_{g, \ \text{parent 2}}\right ) \\ &\text{LPH if} &&\mu_{g, \ \text{hybrid}} < \min \left ( \mu_{g, \ \text{parent 1}}, \ \mu_{g, \ \text{parent 2}}\right ) \\ \\ \text{Equ}&\text{ivalently,} \\ \\ &\text{HPH if} &&\alpha_g + \delta_g > \ \ \ |\alpha_g - \delta_g| \\ &\text{LPH if} &&\alpha_g + \delta_g < -|\alpha_g - \delta_g| \\ \end{align*}\]

Fitting the model with Markov chain Monte Carlo

Slice sampling within Gibbs MCMC

  • Data \(\boldsymbol{y}\), parameters \(\boldsymbol{\theta}\).
  • Sample \(\boldsymbol{\theta}\) from \(p(\boldsymbol{\theta} \ | \ \boldsymbol{y})\) by iterating over the Gibbs steps.

    • Sample \(\nu_\rho\) from its full conditional density \(p(\nu_\rho \ | \ \cdots)\) with slice sampling.
    • \(\nu_\gamma\) similar
    • \(\varepsilon_{n, g}\)'s similar
    • \(\rho_n\)'s similar
    • \(\phi_g\)'s similar
    • \(\alpha_g\)'s similar
    • \(\delta_g\)'s similar

Gibbs steps continued

  • Sample \(\tau_\rho^2\) from its full conditional gamma density with slice sampling.
  • \(\gamma_g^2\) similar, but from an inverse-gamma
  • \(\sigma_\phi^2\) similar, but from a truncated inverse-gamma
  • \(\sigma_\alpha^2\) similar
  • \(\sigma_\delta^2\) similar
  • Sample \(\theta_\phi\) directly from its full conditional normal density.
  • \(\theta_\alpha\) similar
  • \(\theta_\delta\) similar

Stopping criterion

  • Run a pilot MCMC chain for a fixed number of iterations.
  • Initialize 3 additional chains using dispersed starting values relative to the pilot chain MCMC samples.
  • Continue the 4 chains independently until all Gelman-Rubin potential scale reduction factors for "parameters of interest" are less than 1.1 (Gelman, Andrew et al. (2013), Gelman and Shirley (2011)).
    • "Parameters of interest": hyperparameters, \(\alpha_g\)'s, \(\delta_g\)'s.
  • After apparent convergence, combine the four chains and resume the MCMC until 100 effective samples are obtained (Gelman, Andrew et al. 2013) for each hyperparameter.

Inference

  • \(M\) Monte Carlo iterations, index \(m\).
  • \(\alpha_g^{(m)}\) and \(\delta_g^{(m)}\) are Monte Carlo samples.
  • \[ \begin{align*} p\left( \text{$g$ is HPH} \ | \ \boldsymbol{y} \right) &\approx \frac{1}{M} \sum_{m = 1}^M I\left (\alpha_g^{(m)} + \delta_g^{(m)} > \ \ \ \left |\alpha_g^{(m)} - \delta_g^{(m)} \right | \right) \\ p\left( \text{$g$ is LPH} \ | \ \boldsymbol{y} \right) &\approx \frac{1}{M} \sum_{m = 1}^M I\left (\alpha_g^{(m)} + \delta_g^{(m)} < -\left |\alpha_g^{(m)} - \delta_g^{(m)} \right | \right) \\ \end{align*} \]

Effect size

  • \[ \begin{align*} \widehat{\alpha}_g = \frac{1}{M} \sum_{m = 1}^M \alpha_g^{(m)} \qquad \qquad \widehat{\delta}_g = \frac{1}{M} \sum_{m = 1}^M \delta_g^{(m)} \\ \\ \end{align*} \]
  • \[ \begin{align*} \text{Effect size}_g &\approx \left (\widehat{\alpha}_g + \widehat{\delta}_g - \left |\widehat{\alpha}_g - \widehat{\delta}_g \right | \right)^+ \\ &+ \left (\widehat{\alpha}_g + \widehat{\delta}_g + \left |\widehat{\alpha}_g - \widehat{\delta}_g \right | \right)^- \end{align*} \]

Acceleration with CUDA GPUs

Parallel computation across genes

  1. Embarrassingly parallel Gibbs steps for gene-specific parameters.
  2. Parallel reductions for the rest.

1. Embarrassingly parallel Gibbs steps for gene-specific parameters

Sample the \(\gamma_g^2\)'s in parallel

  • \[ \begin{align*} \\ p(\gamma_g^2 | \cdots ) &= \text{Inverse-Gamma} \left ( \frac{N + \nu_\gamma}{2}, \ \frac{1}{2} \left (\nu_\gamma + \sum_{n =1}^N \frac{\varepsilon_{n, g}^2}{\rho_n^2} \right ) \right ) \\ \\ \end{align*} \]

  • \(G = 39656\) parallel GPU threads, one for each \(\gamma_g^2\).

Even more parallelism for the \(\varepsilon_{n, g}\)'s

  • \[ \begin{align*} p(\varepsilon_{n, g} \ | \ \cdots) &\propto \exp \left ( \varepsilon_{n, g}y_{n, g} - \frac{\varepsilon_{n, g}^2}{2 \rho_n^2 \gamma_g^2} - \exp(\varepsilon_{n, g}) \exp(\eta(n, g)) \right) \\ \\ \end{align*} \]

  • \[ \begin{align*} A &= y_{n, g} \\ B &= \frac{1}{2 \rho_n^2 \gamma_g^2} \\ C &= 0 \\ D &= \exp(\eta(n, g)) \\ E &= 0 \\ \\ \\ \end{align*} \]

  • \(N \times G = 12 \times 39656\) threads, one for each \(\varepsilon_{n, g}\).

2. Parallel reductions for non-gene-specific parameters

The pairwise sum: a simple reduction

The pairwise sum: a simple reduction

The pairwise sum: a simple reduction

The pairwise sum: a simple reduction

\[ \begin{align*} p(\theta_\phi \mid \cdots) &= \text{Normal} \left ( \frac{ c_\phi^2 \sum_{g = 1}^G \phi_g}{G c_\phi^2 + \sigma_\phi^2}, \ \frac{c_\phi^2 \sigma_\phi^2}{Gc_\phi^2 + \sigma_\phi^2} \right ) \\ p(\theta_\alpha \mid \cdots) &= \text{Normal} \left ( \frac{c_\alpha^2 \sum_{g = 1}^G \alpha_g}{G c_\alpha^2 + \sigma_\alpha^2}, \ \frac{c_\alpha^2 \sigma_\alpha^2}{G c_\alpha^2 + \sigma_\alpha^2} \right ) \\ p(\theta_\delta \mid \cdots) &= \text{Normal} \left (\frac{c_\delta^2 \sum_{g = 1}^G \delta_g}{G c_\delta^2 + \sigma_\delta^2}, \ \frac{c_\delta^2 \sigma_\delta^2}{G c_\delta^2 + \sigma_\delta^2} \right ) \\ \end{align*} \]

\[ \begin{align*} p(\tau_\rho^2 \mid \cdots) &= \text{Gamma} \left (\text{shape} = a_\rho + \frac{N\nu_\rho}{2}, \ \text{rate} = b_\rho + \frac{\nu_\rho}{2} \sum_{n = 1}^N \frac{1}{\rho_n^2} \right ) \\ p(\rho_n^2 | \cdots ) &= \text{Inverse-Gamma} \left ( \frac{G + \nu_\rho}{2}, \ \frac{1}{2} \left (\nu_\rho \tau_\rho^2 + \sum_{g =1}^G \frac{\varepsilon_{n, g}^2}{\gamma_g^2} \right ) \right ) \\ p(\sigma_\phi^2 \ | \ \cdots) &= \text{Inverse-Gamma} \left ( \frac{G - 1}{2}, \ \frac{1}{2} \sum_{g = 1}^G (\phi_g - \theta_\phi)^2 \right ) \text{I} (\sigma_\phi^2 < s_\phi^2) \\ p(\sigma_\alpha^2 \ | \ \cdots) &= \text{Inverse-Gamma} \left ( \frac{G - 1}{2}, \ \frac{1}{2} \sum_{g = 1}^G (\alpha_g - \theta_\alpha)^2 \right ) \text{I} (\sigma_\alpha^2 < s_\alpha^2) \\ p(\sigma_\delta^2 \ | \ \cdots) &= \text{Inverse-Gamma} \left (\frac{G - 1}{2}, \ \frac{1}{2} \sum_{g = 1}^G (\delta_g - \theta_\delta)^2 \right ) \text{I} (\sigma_\delta^2 < s_\delta^2) \end{align*} \]

\[ \begin{align*} p(\nu_\gamma \mid \cdots) \propto \exp &\left ( - \log \Gamma \left ( \frac{\nu_\gamma}{2} \right) + \frac{\nu_\gamma}{2} \log \left ( \frac{\nu_\gamma}{2} \right ) \right . \\ & \left . \ \ - \nu_\gamma \frac{1}{G} \sum_{g = 1}^G \left [ \log \gamma_g + \frac{1}{2} \frac{1}{\gamma_g^2} \right ] \right )^G \times I(0 < \nu_\gamma < d_\gamma) \\ \\ \end{align*} \]

  • \[ \begin{align*} &\frac{d^2}{d \nu_\gamma^2} \log p(\nu_\gamma \mid \cdots) = - \frac{G}{4} \psi^{(1)} \left ( \frac{\nu_\gamma}{2} \right) + \frac{G}{2 \nu_\gamma} \qquad (0 < \nu_\gamma < d_\gamma) \\ \\ \end{align*} \]

  • Applying Qi and Guo (2013) Lemma 1 to the trigamma function \(\psi^{(1)}\) shows \[ \begin{align*} \frac{d^2}{d \nu_\gamma^2} \log p(\nu_\gamma \mid \cdots) < 0 \qquad (0 < \nu_\gamma < d_\gamma) \end{align*} \]

\[ \begin{align*} p(\nu_\rho \mid \cdots) \propto \exp & \left ( - \log \Gamma \left ( \frac{\nu_\rho}{2} \right) + \frac{\nu_\rho}{2} \log \left ( \frac{\nu_\rho \tau_\rho^2}{2} \right ) \right . \\ & \left . \ \ - \nu_\rho \frac{1}{N} \sum_{n = 1}^N \left [ \log \rho_n + \frac{\tau_\rho^2}{2} \frac{1}{\rho_n^2} \right ] \right )^N \times I(0 < \nu_\rho < d_\rho) \\ \\ \end{align*} \]

  • Similarly, \[ \begin{align*} \frac{d^2}{d \nu_\rho^2} \log p(\nu_\rho \mid \cdots) < 0 \qquad (0 < \nu_\rho < d_\rho) \end{align*} \]

Results

Red points: 7 genes with Gelman factors slightly above 1.1 for \(\alpha_g\) or \(\delta_g\).

Data generated from the model

\[\begin{align*} N &= 12 \qquad \qquad &&\nu_\rho = 5 \\ G &= 35,000 \qquad \qquad &&\nu_\gamma = 5 \\ & &&\tau_\rho = 0.1 \\ \\ \theta_\phi &= 3 \qquad \qquad && \sigma_\phi = 1 \\ \theta_\alpha &= 0 \qquad \qquad && \sigma_\alpha = 0.5 \\ \theta_\delta &= 0 \qquad \qquad && \sigma_\delta = 0.5 \end{align*}\]

Thesis outline

Paper 1: a computational study

  • Lay out a general strategy for accelerating Markov chain Monte Carlo algorithms with parallel computation.
  • Use the heterosis model as a case study.
    • Full conditionals of gene-specific parameters have general forms that can be slice-sampled in parallel.
    • Other full conditionals depend on sums and products that parallel reductions can compute quickly.
    • Carry out a proof of concept using data simulated from the model.

Paper 2: a case study

  • Introduce an empirical Bayes counterpart method.
  • Is the fully Bayesian approach really better?
    • Paschold data results
    • Heterosis gene detection power
    • False discovery rate control
    • Runtime

Paper 3: different priors

  • The model may improve with Laplace, Student t, or horseshoe prior distributions on \(\phi_g\), \(\alpha_g\), and \(\delta_g\).
  • Also borrow outside methods of analyzing heterosis data and compare as in paper 2
    • edgeR (Robinson, MD and McCarthy, DJ and Smyth, GK 2010)
    • baySeq (Hardcastle 2012)
    • Empirical Bayes for microarray (Ji, Tieming and Liu, Peng, and Nettleton, Dan 2014)
    • Empirical Bayes for RNA-seq, different model (Niemi, Jarad and Mittman, Eric and Landau, Will and Nettleton, Dan submitted 2015)

References

Gelman, Andrew, and Kenneth Shirley. 2011. “Handbook of Markov Chain Monte Carlo.” In. Chapman and Hall/CRC.

Gelman, Andrew et al. 2013. “Bayesian Data Analysis.” In, 3rd ed. CRC Press.

Hardcastle, Thomas J. 2012. BaySeq: Empirical Bayesian Analysis of Patterns of Differential Expression in Count Data.

Ji, Tieming and Liu, Peng, and Nettleton, Dan. 2014. “Estimation and Testing of Gene Expression Heterosis.” Jounral of Agricultural, Biological, and Environmental Statistics 19 (3): 319–37.

Niemi, Jarad and Mittman, Eric and Landau, Will and Nettleton, Dan. submitted 2015. “Empirical Bayes analysis of RNA-seq data for detection of gene expression heterosis.” Jounral of Agricultural, Biological, and Environmental Statistics.

Paschold, Anja, Yi Jia, Caroline Marcon, Steve Lund, Nick B Larson, Cheng-Ting Yeh, Stephan Ossowski, et al. 2012. “Complementation Contributes to Transcriptome Complexity in Maize (Zea Mays L.) Hybrids Relative to Their Inbred Parents.” Genome Research 22 (12). Cold Spring Harbor Lab: 2445–54.

Qi, Feng, and Bai-Ni Guo. 2013. “Refinements of lower bounds for polygamma functions.” Proceedings of the American Mathematical Society 141 (3): 1007–15.

Robinson, MD and McCarthy, DJ and Smyth, GK. 2010. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26: 139–40.